課程背景與深度學習可重現性危機

隨著我們從簡單、自包含的模型轉向里程碑專案1所需的複雜多階段架構，以試算表或本地檔案手動追蹤關鍵參數已完全不可持續。這種複雜的工作流程會對開發完整性帶來嚴重風險。

1. 識別重現性的瓶頸

深度學習的工作流程本質上因眾多變數（優化演算法、資料子集、正則化技術、環境差異）而具有高度變異性。若無系統性追蹤，重現特定過去結果——這對於除錯或改進已部署模型至關重要——往往不可能實現。

哪些內容必須被追蹤？

超參數： All configuration settings must be recorded (e.g., Learning Rate, Batch Size, Optimizer choice, Activation function).

環境狀態： Software dependencies, hardware used (GPU type, OS), and exact package versions must be fixed and recorded.

成果與結果： Pointers to the saved model weights, final metrics (Loss, Accuracy, F1 score), and training runtime must be stored.

The "Single Source of Truth" (SSOT)

Systematic experiment tracking establishes a central repository—a SSOT—where every choice made during model training is recorded automatically. This eliminates guesswork and ensures reliable auditability across all experimental runs.

終端機bash — tracking-env

> 已準備就緒。點選「執行概念性追蹤」以查看工作流程。

實驗追蹤 Live

Simulate the run to visualize the trace data captured.

問題 1

深度學習可重現性危機的根本原因為何？

PyTorch 對 CUDA 驅動程式的依賴。

未被追蹤的變數數量龐大（程式碼、資料、超參數與環境）。

大型模型過度消耗記憶體。

生成成果的計算成本。

問題 2

在 MLOps 的背景下，為什麼系統化實驗追蹤對生產環境至關重要？

它能最小化模型成果的總體儲存空間。

它確保能可靠地重建並部署達成報告績效的模型。

它能加速模型的訓練階段。

問題 3

哪一項元素是重現結果所必需的，卻最常在手動追蹤中被遺漏？

執行的訓練週期數。

所有 Python 套件的具體版本與所使用的隨機種子。

所使用的資料集名稱。

訓練開始的時間。

挑戰：過渡期的追蹤

為什麼過渡到正式追蹤是不容妥協的。

You are managing 5 developers working on Milestone Project 1. Each developer reports their best model accuracy (88% to 91%) in Slack. No one can reliably tell you the exact combination of parameters or code used for the winning run.

第一步

必須立即實施哪一步驟，以阻止關鍵資訊的流失？

解答：
Implement a mandatory requirement for every run to be registered with an automated tracking system before results are shared, capturing the full hyperparameter dictionary and Git hash.

第二步

結構化追蹤能為團隊帶來哪些共享試算表無法提供的好處？

解答：
Structured tracking allows automated comparison dashboards, visualizations of parameter importance, and centralized artifact storage, which is impossible with static spreadsheets.